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ABSTRACT 

I wish to propose a quite speculative new version of the grandmother cell theory to 
explain how the brain, or parts of it, may work. In particular, I discuss how the visual 
system may learn to recognize 3D objects. The model would apply directly to the cortical 
cells involved in visual face recognition. I will also outline the relation of our theory to 
existing models of the cerebellum and of motor control. Specific biophysical mechanisms can 
be readily suggested as part of a basic type of neural circuitry that can learn to approximate 
multidimensional input-output mappings from sets of examples and that is expected to be 
replicated in different regions of, the brain and across modalities. The main points of the 
theory are: 

• the brain uses modules for multivariate function approximation as basic components of 
several of its information processing subsystems. 

• these modules are realized as HyperBF networks (Poggio and Girosi, 1990a,b). 

• HyperBF networks can be implemented in terms of biologically plausible mechanisms 
and circuitry. 

The theory predicts a specific type of population coding that represents an extension of 
schemes such as look-up tables. I will conclude with some speculations about the trade-off 
between memory and computation and the evolution of intelligence. 
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1 The Grandmother Neuron Theory 

A classical theme in the neurophysiological literature at least since the 
work of Hubel and Wiesel (1962) is the idea of information processing in 
the brain as leading to "grandmother" neurons responding selectively to 
the precise combination of visual features that are associated with one's 
grandmother. The "grandmother" neuron theory is of course not restricted 
to vision and applies as well to other sensory modalities and even to mo- 
tor control under the form of cells corresponding to elemental movements. 
Why is this idea so attractive? The idea is attractive because of its sim- 
plicity: it replaces complex information processing with the superficially 
simpler task of accessing a memory. The problem of recognition and motor 
control would be solved by simply accessing look-up tables containing ap- 
propriate descriptions of objects and of motor actions. The human brain 
can probably exploit a vast amount of memory with its 10 14 or so synapses, 
making attractive any scheme that replaces computation with memory. In 
the case of vision the apparent simplicity of this solution hides the diffi- 
cult problems of an appropriate representation of an object and of how to 
extract it from complex images. But even assuming that these problems 
of representation, feature extraction and segmentation could be solved by 
other mechanisms, a fundamental difficulty seems to be intrinsic to the 
"grandmother" cell idea. The difficulty consists of the combinatorial ex- 
plosion in the number of cells that any scheme of the look-up table type 
would reasonably require for either vision or motor control. In the case 
of 3D object recognition, for instance, there should be for each object as 
many entries in the look-up table as there are 2-D views of the object, in 
principle an infinite number. 

The difficulty of a combinatorial explosion lies at the heart of theories 
of intelligence that attempt to replace information processing with look- 
up tables of precomputed results. In this paper we suggest a scheme that 
avoids the combinatorial problem, while retaining the attractive features of 
the look-up table. The basic idea is to use only a few entries and interpolate 
or approximate among them. A mathematical theory based on this idea 
leads to a powerful scheme of learning from examples that is equivalent 
to a parallel network of simple processing elements. The scheme has an 
intriguingly simple implementation in terms of plausible biophysical mech- 
anisms. We will discuss in particular the case of 3D object recognition 
but will propose that the scheme is possibly used by the brain for several 
different information processing tasks. Many information processing prob- 
lems can be represented as the composition of one or more multivariate 



functions that map an input signal into an output signal in a smooth way. 
These modules could be synthesized from a sufficient set of input-output 
pairs - the examples - by the scheme described here. Because of the power 
and general applicability of this mechanism, we speculate that a part of 
the machinery of the brain - including perhaps some of the cortical cir- 
cuitry which is somewhat similar across the different modalities - may be 
dedicated to the task of function approximation. 



2 How to Synthesize through Learning the 
Basic Approximation Module: Regular- 
ization Networks 

This section describes a technique for synthesizing the approximation mod- 
ules discussed above through learning from examples. I first explain how 
to rephrase the problem of learning from examples as a problem of ap- 
proximating a multivariate function. The material in this section is from 
Poggio and Girosi (1989, 1990a, 1990b), where more details can be found. 

To illustrate the connection, let us draw an analogy between learning an 
input-output mapping and a standard approximation problem, 2-D surface 
reconstruction from sparse data points. Learning simply means collecting 
the examples, i.e., the input coordinates x;, y; and the corresponding output 
values at those locations, the heights of the surface d{. Generalization 
means estimating d at locations x,y where there are no examples, i.e., no 
data. This requires interpolating or, more generally, approximating the 
surface (i.e., the function) between the data points (interpolation is the 
limit of approximation when there is no noise in the data). In this sense, 
learning is a problem of hypersurface reconstruction (Poggio et al., 1988, 
1989; Omohundro, 1987). 

From this point of view, learning a smooth mapping from examples is 
clearly ill-posed, in the sense that the information in the data is not suf- 
ficient to reconstruct uniquely the mapping at places where data are not 
available. In addition, the data are usually noisy. A priori assumptions 
about the mapping are needed to make the problem well-posed. One of 
the simplest assumptions is that the mapping is smooth: small changes in 
the inputs cause -a small -change in the output. -Techniques that exploit 
smoothness constraints in order to transform an ill-posed problem into a 
well-posed one are well known under the term of regularization theory, and 
have interesting Bayesian applications (Tikhonov and Arsenin, 1977; Pog- 



gio, Torre and Koch, 1985; Bertero, Poggio and Torre, 1988). We have 
recently shown that the solution to the approximation problem given by 
regularization theory can be expressed in terms of a class of multilayer net- 
works that we call regularization networks or Hyper Basis Functions (see 
Fig. 1). Our main result (Poggio and Girosi, 1989) is that the regulariza- 
tion approach is equivalent to an expansion of the solution in terms of a 
certain class of functions: 

/(x) = 5>G(x; *,) + !>(*) (1) 

where G(x) is one such function and the coefficients C{ satisfy a linear 
system of equations that depend on the N "examples," i.e., the data to 
be approximated. The term p(x) is a polynomial that depends on the 
smoothness assumptions. In many cases it is convenient to include up 
to the constant and linear terms. Under relatively broad assumptions, 
the Green's function G is radial and therefore the approximating function 
becomes: 

/(x) = £>G(||x-£,.|| 2 ) + p(x), (2) 

1=1 

which is a sum of radial functions, each with its center £ { on a distinct 
data point and of constant and linear terms (from the polynomial, when 
restricted to be of degree one). The number of radial functions, and corre- 
sponding centers, is the same as the number of examples. 

The interpretation of Eq. 2 is simple: for instance, in the 2D case - 
in which the examples corresponds to points of the x,y plane where the 
height of the surface is known and generalization corresponds to estimate 
the height of the surface at a point in the plane where data are not avail- 
able - the surface is approximated by the superposition of, say, several 
two dimensional Gaussian distributions, each centered on one of the data 
points. 

Our derivation shows that the type of basis functions depends on the 
specific a priori assumption of smoothness. Depending on it we obtain the 
Gaussian G(r) = e~W , the well known "thin plate spline" G(r) = r 2 lnr, 
and other specific functions, radial and not. As observed by Broomhead 
and Lowe (1989) in the radial case, a superposition of functions like Eq. 1 
is equivalent to a network of the type shown in Fig. lb. 

The network associated with Eq. 2 can be made more general in terms 
of the following extension 



/•(x) = ]Tc a G(||x-t a )|^) + p(x) (3) 



a=l 



where the parameters t a , that we call "centers," and the coefficients c a are 
unknown, and are in general much fewer than the data points (n < TV). 
The norm is a weighted norm 

||(x - t a )\\ 2 w = (x - t a ) T W T W{x. - t a ) (4) 

where W is an unknown square matrix and the superscript T indicates the 
transpose. In the simple case of diagonal W the diagonal elements W{ assign 
a specific weight to each input coordinate, determining in fact the units of 
measure and the importance of each feature (the matrix W is especially 
important in cases in which the input features are of a different type and 
their relative importance is unknown). Equation 3 can be implemented by 
the network of Fig. 1. A sigmoid function at the output may sometimes 
be useful without increasing the complexity of the system (see Poggio and 
Girosi, 1989). Notice that there could be more than one set of Green's 
functions, for instance a set of multiquadrics and a set of Gaussians, each 
with its own W. Two or more sets of Gaussians, each with a diagonal W, 
are equivalent to sets of Gaussians with their own as. 

2.1 Learning 

In the framework of the previous section the stage of learning is simply 
the stage of estimating from the data - the examples - the values of the 
parameters in the representation we have derived, i.e. equation (4). Itera- 
tive methods can be used to find the optimal values of the various sets of 
parameters, the c a , the Wi and the t a , that minimize an error functional 
on the set of examples. Steepest descent is the standard approach that 
requires calculations of derivatives. An even simpler method that does not 
require calculation of derivatives (suggested and found surprisingly efficient 
in preliminary work by Caprile and Girosi, personal communication) is to 
look for random changes (controlled in appropriate ways) in the parameter 
values that reduce the error. We define the error functional - also called 
energy - as 

H\n = ffct.w = E(a ; ) 2 , 

t=l 

with 



Ai = yi- /*(x) = yi -Y^ c a G(\\xi - t a ||w)- 



a=l 



In the first method the values of c a , t a and W that minimize H[f*] 
are regarded as the coordinates of the stable fixed point of the following 
dynamical system: 

dmn 

c<x = -v— , a = l,...,ra 

Oc a 

ta = -u— ^7 , a = l,...,n 
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where a; is a parameter. The derivatives are rather complex (see Poggio 
and Girosi, 1990a and Notes section). 

The second method is simpler: random changes in the parameters are 
made and accepted if H[f*} decreases. Occasionally, changes that increase 
H[f*] may also be accepted (similarly to the Metropolis algorithm). 

2.2 Interpretation of the Network 

The interpretation of the network of Fig. 1 is the following. After learn- 
ing, the centers of the basis functions are similar to prototypes, since they 
are points in the multidimensional input space. Each unit computes a 
(weighted) distance of the inputs from its center, that is a measure of their 
similarity, and applies to it the radial function. In the case of the Gaussian, 
a unit will have maximum activity when the new input exactly matches 
its center. The output of the network is the linear superposition of the 
activities of all the basis functions in the network, plus direct, weighted 
connections from the inputs (the linear terms of p(x)) and from a con- 
stant input (the constant term). Notice that in the limit case of the basis 
functions approximating delta functions, the system becomes equivalent to 
a look-up table. During learning the weights c are found by minimizing 
a measure of the error between the network's prediction and each of the 
examples. At the same time, the centers of the radial functions and the 
weights in the norm are also-updated during learning.- Moving the centers 
is equivalent to modifying the corresponding prototypes and corresponds 
to task-dependent clustering. Finding the optimal weights W for the norm 



is equivalent to transforming appropriately, for instance scaling, the input 
coordinates and corresponds to task- dependent dimensionality reduction. 
Regularization networks- of which HyperBFs are the most general and 
powerful version - represent a general framework for learning smooth map- 
pings that rigorously connects approximation theory, generalized splines 
and regularization with feedforward multilayer networks. They also con- 
tain as special cases the Radial Basis Functions technique (Micchelli, 1986; 
Powell, 1987; Broomhead and Lowe, 1988) and several well-known algo- 
rithms, especially in the pattern recognition literature. 

3 A Proposal for a Biological Implementa- 
tion 

In this section we point out some remarkable properties of Gaussian Hy- 
perBF, that may have implications for neurobiology. 

3.1 Factorizable Radial Basis Functions 

The synthesis of (weighted) radial basis functions in high dimensions may 
be easier if they are factorizable. It is easily seen that the only radial 
basis function which is factorizable is the Gaussian (with diagonal W). A 
multidimensional Gaussian function can be represented as the product of 
lower dimensional Gaussians. For instance a 2D Gaussian radial function 
centered in t can be written as: 

G(||x - t||^) = c-"l*-*llV = e'^Te""^ , ( 5 ) 

with a x = 1/wi and a y = l/w 2 , where w x and w 2 are the elements of the 
matrix W assumed, in this section, to be diagonal. 

This dimensionality factorization is especially attractive from the phys- 
iological point of view, since it is difficult to imagine how neurons could 
compute <?(||x-t a || 2 ). The scheme of figure 2, on the other hand, is 
physiologically plausible. Gaussian radial functions in one, two and pos- 
sibly three dimensions can be implemented as receptive fields by weighted 
connections from the sensor arrays (or some retinotopic array of units rep- 
resenting with their activity the position of features). Gaussians in higher 
dimensions can then be synthesized as products of one and two dimensional 
receptive fields. 

This scheme has three additional interesting features: 



1. the multidimensional radial functions are synthesized directly by ap- 
propriately weighted connections from the sensor arrays, without any 
need of an explicit computation of the norm and the exponential. 

2. 2D Gaussians operating on the sensor array or on a retinotopic array 
of features extracted by some preprocessing transduce the implicit 
position of features in the array into a number (the activity of the 
unit). 

3. 2D Gaussians acting on a retinotopic map can be regarded each as 
representing one 2D "feature," i.e., a component of the input vec- 
tor, while each center represents the "template," resulting from the 
conjunction of those lower- dimensional features. Notice that in this 
analogy the radial basis function is the AND of several features and 
could also include the negation of certain features, that is the AND 
NOT of them. W weights the importance of the different features. 

3.2 Biophysical Mechanisms 
3.2.1 The Network 

The multiplication operation required by the previous interpretation of 
Gaussian GRBFs to perform the "conjunction" of Gaussian receptive fields 
is not too implausible from a biophysical point of view. It could be per- 
formed by several biophysical mechanisms (see Koch and Poggio, 1987). 
Here we mention three mechanisms: 

1. inhibition of the silent type and related circuitry (see Torre and Pog- 
gio, 1978; Poggio and Torre, 1978) 

2. the AND-like mechanism of NMDA receptors 

3. a logarithmic transformation, followed by summation, followed by 
exponentiation. The logarithmic and exponential characteristic could 
be implemented in appropriate ranges by the sigmoid-like pre-to- 
postsynaptic voltage transduction of many synapses. 

If the first or the second mechanism is used, the product of figure 3 
can be performed directly on the dendritic tree of the neuron representing 
the corresponding radial function (alternatively, each dendritic tree may 
perform pairwise products only, in which case a logarithmic number of 
cells would be required). The scheme also requires a certain amount of 
memory per basis unit, in order to store the center vector. In the case of 



Gaussian receptive fields used to synthesize Gaussian radial basis functions, 
the center vector is effectively stored in the position of the 2D (or ID) 
receptive fields and in their connections to the product unit(s). This is 
plausible physiologically. 

The linear terms (the direct connections from the inputs to the output 
in figure 1) can be realized directly as inputs to the output neuron that 
summates linearly its synaptic inputs (an output nonlinearity is allowed 
and will not change the basic form of the model, see Poggio and Girosi, 
1989). They may also be realized through intermediate linear units. 

3.2.2 Mechanisms for Learning 

Do the update schemes have a physiologically plausible implementation? 
Consider first the steepest descent methods, which require derivatives. 
Equation (6) or a somewhat similar, quasi-hebbian scheme is not too un- 
likely and may require only a small amount of neural circuitry. Equation 
(7) seems more difficult to implement for a network of real neurons. 

Methods such as the random descent method, which do not require 
calculation of derivatives are biologically much more plausible and seem 
to perform very well in preliminary experiments. In the Gaussian case, 
with basis functions synthesized through the product of Gaussian receptive 
fields, moving the centers means establishing or erasing connections to the 
product unit. A similar argument can be made also about the learning of 
the matrix W. Notice that in the diagonal Gaussian case the parameters 
to be changed are exactly the a of the Gaussians, i.e., the spread of the 
associated receptive fields. Notice also that the <r for all centers on one 
particular dimension is the same, suggesting that the learning of Wi may 
involve the modification of the scale factor in the input arrays rather than 
a change in the dendritic spread of the postsynaptic neurons. 

In all these schemes the real problem consists in how to provide the 
"teacher" input (but see figure 5). 

4 Visual Recognition of 3D Objects and 
Face Sensitive Neurones 

We have recently suggested and demonstrated how to use a HyperBF net- 
work to learn to recognize a 3D object. This section reviews very briefly 
this work (Poggio and Edelman, 1990), where more references can be found, 
and then suggests that the brain may use a similar strategy. Face sensitive 
neurons are discussed as a specific instance. 



4.1 HyperBF Networks for Recognizing 3D Objects 

A 3D object gives rise to an infinite variety of 2D images or views, because 
of the infinite number of possible poses relative to the viewer, and because 
of arbitrarily different illumination conditions. Is it possible to synthesize 
a module that can recognize an object from any viewpoint, after it learns 
its 3D structure from a small set of perspective views? We have have 
recently shown (Poggio and Edelman, 1990) that the HyperBF scheme 
may provide a solution to the problem provided that relatively stable and 
uniquely identifiable features (that we will call "labeled" features) can be 
extracted from the image. 

In our scheme a view is represented as a 2N vector x u y u x 2l 3/2 > . . . , x^, y N 
of the coordinates on the image plane of N labeled and visible feature points 
on the object. We assume that a view of an object is a vector of this type 
(instead of position in the image of feature points we have also used angles 
between corners and length of segments or both), in general augmented by 
components that represent other properties of the object not necessarily 
related to its geometric shape, such as color or texture. We also assume 
that the function that maps the views into 0, 1 (0 if the view is of another 
object, 1 if the view is of the correct object) can be approximated by a 
smooth function (if this were false, one could approximate the mapping 
from the view to a "standard" view and then apply a radial function to 
the result, see Poggio and Edelman, 1990). 

The network used for this task is shown in Figure 3 (see also Figure 
4). In the simplest version (fixed centers) the centers correspond to some 
of the examples, i.e., some views of the object. Updating the centers is 
equivalent to modifying the corresponding "prototypical views". Updating 
the weights of the matrix W corresponds to changing the relative impor- 
tance of the various features that define the views of an object. This is 
important in the case in which these features are of a completely different 
type: a large w indicates a larger weight in the feature in the measure of 
similarity and is equivalent to a small a in the Gaussian function. Fea- 
tures with a small role have a very large a: their exact position or value 
does not matter much. Of course, the problem the network solves is a 
caricature of the full problem of object recognition: one isolated object, 
without occlusions or noise and moreover with image features assumed to 
be matched to models features (for a similar approach see [2]). Existing 
computer vision algorithms for model-based recognition typically deal with 
more complex situations. We think however that the approach described 
here can be extended to more realistic tasks. In a first step in this direction 
we have successfully extended the algorithm to deal with noisy, real, mildly 
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selfo ccluding objects (Brunelli and Poggio, in press). 

An interesting conclusion of this work consists of the small number 
of views that is required to recognize an object from the infinite number 
of possible views. The results clearly show that the scheme avoids the 
main problem of look-up table schemes, the explosion in the number of 
entries. Furthermore, the performance of the HyperBF recognition scheme 
resembles human performance in a related task. As discussed in Poggio 
and Edelman (1990), the number of training views necessary to achieve 
an acceptable recognition rate on novel views, 80-100 for the full viewing 
sphere, is broadly compatible with the finding that people have trouble 
recognizing a novel wire-frame object previously seen from one viewpoint 
if it is rotated away from that viewpoint by about 30° (it takes 72 30° x 30° 
patches to cover the viewing sphere). 

Recently, Biilthoffand Edelman (1990) and references therein have ob- 
tained interesting psychophysical results that support this model for human 
recognition of a certain class of 3D objects against other possible models. 
In general, the experimental results fit closely the prediction of theories of 
the 2D interpolation variety and appear to contradict theories that involve 
3D models. 

4.2 Face Sensitive Neurons 

The HyperBF recognition scheme we have outlined has suggestive simi- 
larities with some of the data about visual neurons responding to faces 
obtained by Perrett and coworkers recording from the temporal associa- 
tion cortex (see Perrett et al., 1987 and references therein, Poggio and 
Edelman, 1990). Let us consider the network of figure 3 as the skeleton for 
a model of the circuitry involved in the recognition of faces. One expects 
different modules one for each different object of the type of the network of 
Figure 3. One also expects hierarchical organizations: for instance a net- 
work of the HyperBF type may be used to recognize certain types of eyes 
and then may serve as input to another network involved in recognizing a 
certain class of faces, which may be itself one of the inputs to a network for 
a specific face. Different types of cells may then be expected. The overall 
output of a network for a specific face may be identified with the behavioral 
responses associated with recognition and may or may not coincide with an 
individual neuron. There should be cells or parts of cells corresponding to 
the centers, i.e., to the prototypes used by the networks. The response of 
these neurons should be a Gaussian function of the distance of the input to 
the template. These units would be somewhat similar to "grandmother" 
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filters with a graded response, rather than binary detectors, each repre- 
senting a prototype. They would be synthesized as the conjunction of, for 
instance, two-dimensional Gaussian receptive fields looking at a retinotopic 
map of features. During learning, the weights of the various prototypes in 
the network output are modified to find the optimal values that minimize 
the overall error. The prototypes themselves are slowly changed to find op- 
timal prototypes for the task. The weights of the different input features 
is also modified to perform task-dependent dimensionality reduction. 

Some of these expectations are consistent with the experimental find- 
ings of Perret et al. (1987). Some of the neurons described have several 
of the properties expected from the units of a HyperBF network with a 
center, i.e., a prototype that corresponds to a view of a specific face. 

Some of the Main Data (from Perret et al, 1987 and references therein) 

• The majority of cells responsive to faces are sensitive to the general 
characteristics of the face and they are somewhat invariant to its 
exact position and attitude. 



• 



• 



Presenting parts of the face in isolation revealed that some of the 
cells responded to different subsets of features: some cells are more 
sensitive to parts of the face such as eyes or mouth. 

There are cells selective for a particular view of the head. Some 
cells were maximally sensitive to the front view of a face, and their 
response fell off as the head was rotated into the profile view, and 
others were sensitive to the profile view with no response to the front 
view of the face. 

There are cells that are specific to the views of one individual. It 
seems that for each known person there would be a set of 'face recog- 
nition units'. Our model applies most directly to these neurons. 

5 Theories of the Cerebellum and of Motor 
Control 

5.1 Man's and Albus Models of the Cerebellum 

The cerebellum is a part of the brain that is important in the coordination 
of complex muscle movements. The neural organization of the cerebellum 
is highly regular and well known (see Figure 5). Marr (1969) and Albus 
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(1972) modeled the cerebellum as a look-up table. The critical part of 
their theories is the assumption that the synapses between the parallel 
fibers and the Purkinje cells are modified as a function of the Purkinje cell 
activity and the climbing fibers input. I suggest (see figure 5) that the 
cerebellum is a HyperBF network or set of networks (one for each Purkinje 
cell). Instead of a simple look-up table, the cerebellum would be a function 
approximation module (in a sense, "an approximating look-up table"). In 
our conjecture, basket and Golgi cells would have different roles from the 
roles assumed in the Marr-Albus theory. In particular, the Golgi cells, 
which receive inputs from the parallel fibers and whose axons synapse on 
the granule cells-mossy fibers clusters, may be used to change the norm 
weights W. 

Key Assumptions 

• granule cells correspond to basis units (there may be as many as 
200,000 granule cells per Purkinje cell) representing as many "exam- 
ples" 

• Purkinje cells are the outputs of the network 



• 



climbing fibers are responsible for modifying synapses from granule 
cells to the Purkinje cell. 



5.2 Theories of Motor Control 

There are at least two aspects of motor control in which HyperBF modules 
could be used 



• 



to compute smooth, time-dependent trajectories - for instance arm 
trajectories - given sparse points such as initial, final and intermedi- 
ate positions. 

• to associate to each position in the trajectory the appropriate field 
of muscle forces. 

These two problems may be solved by two modules that can be used in 
series, the first one providing the input to the second one (see figure 6a 
and 6b). I will-first consider the-problem of -computing appropriate smooth 
trajectories from sparse points in space-time. An interesting question is: 
are HyperBFs a plausible implementation for Flash and Hogan's minimum 
jerk principle for the coordination of arm movements? Flash and Hogan 
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(1985) found experimental evidence that arm trajectories minimize jerk, 
i.e., C = ||a;( 3 ) + 2/^|| 2 , where x^ is the third temporal derivative of x. 
This suggests a regularization principle with a stabilizer corresponding to 
additive quintic splines. HyperBF could implement it using basis units 
recruited for the specific motion (as many as there are constrained points) 
with Gaussian-like or spline-like time-dependent activities (boundary con- 
ditions may have to be taken into account). The weights would be learned 
during training. As Morasso and Mussa Ivaldi (1982) implied, approxi- 
mation schemes of this type amount to composition of elemental move- 
ments. It is interesting to observe that jerk is automatically minimized 
by the linear superposition of the appropriate elemental movements, i.e., 
the appropriate Green's functions. Thus a scheme of the Morasso-Mussa 
Ivaldi type can be made to be perfectly equivalent to the Flash-Hogan 
minimization principle. The fact that the minimum jerk principle can be 
implemented directly by a HyperBF network is attractive from the point 
of view of a biological implementation since biologically implausible direct 
minimization procedures are not required anymore. The minimization is 
implicit in the form of the elemental movements; weighted superposition 
of the elemental movements seems a much easier operation to implement 
in the motor system than explicit minimization. 

The second problem requires a neural circuit that associates an equi- 
librium position to an appropriate activation. Bizzi (see for instance Bizzi 
et al., 1990) suggests that a group of spinal cord interneurons specify the 
limb's final position and configuration through a field of muscle forces that 
have the appropriate equilibrium point. Bizzi et al. (1990) propose that 
the spinal cord contains aspects of motor behavior reminiscent of a look- 
up table. Their findings extend several results in the area of oculomotor 
research, where investigators have described neural structures whose acti- 
vation brings the eyes or the head to a unique position. I suggest that the 
required look-up table behavior may be implemented through a HyperBF 
module that requires the storage of only a few equilibrium position (or 
correspondingly a few conservative-like fields, i.e., appropriate activation 
coefficients for the motoneurons) and can interpolate between them (see 
figure 6). Notice that the synthesis of a conservative field of muscle force 
could be achieved through the superposition (with arbitrary weights, over 
the index a) by the motor system of appropriate elementary motor fields 
of the form (see Mussa-Ivaldi and Giszter, 1990 in preparation): 
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with r a — ||x — x a || and G is a radial basis function such as the Gaussian. 

6 Summary: a Proposal for How the Brain 
Works 

The theory proposed in this paper consists of three main points: 

1. it assumes that the brain may use modules that approximate multi- 
variate functions and that can be synthesized from sparse examples 
as basic components for several information processing tasks. 

2. it proposes that these modules are realized in terms of HyperBF 
networks, of which a rigorous theory is now available. 

3. it shows how HyperBF networks can be implemented in terms of 
plausible biophysical mechanisms. 

The theory is in a sense a modern version of the grandmother neurons 
idea, made computationally plausible by eliminating the combinatorial ex- 
plosion in the number of required cells that was the main problem in the 
old idea. 

The proposal that much information processing in the brain is per- 
formed through modules that are similar to enhanced look-up tables is 
attractive for many reasons. It also promises to bring closer apparently 
orthogonal views, such as the immediate perception of Gibson and the rep- 
resentational theory of Marr, since almost iconic "snapshots" of the world 
may allow the synthesis of computational mechanisms completely equiv- 
alent to vision algorithms such as, say, structure-from-motion. The idea 
seems to change significantly the computational perspective on several vi- 
sion tasks. As a simple example, consider the different specific tasks of 
hyperacuity, as considered by the psychophysicists. The theory developed 
here would suggest that an appropriate module for the task, somewhat 
similar to a new "routine," may be synthesized by learning in the brain. 

Notice that the theory makes two independent claims: the first is that 
the brain can be explained in part in terms of approximation modules, 
the second is that these modules are of the HyperBF type. The second 
claim implies that the modules are an extension of look-up tables. Notice 
that there are schemes other than HyperBF that could be used to extend 
look-up tables. Notice also that multilayer Perceptrons, typically used in 
conjunction with back-propagation, can also be considered as approxima- 
tion schemes, albeit still without a convincing mathematical foundation. 
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Unlike HyperBF networks, they cannot be interpreted as direct extensions 
of look-up tables (they are more similar to an extension of multidimensional 
Fourier series). 

The theory suggests that population coding (broadly tuned neurons 
combined linearly) is a consequence of extending a look-up table scheme - 
corresponding to interval coding - to yield interpolation (or more precisely 
approximation, since the examples may be noisy), that is generalization. 

The theory suggests some possibly interesting ideas about the evolution 
of intelligence. It also makes a number of predictions for physiology and 
psychophysics. More work is needed to specify sufficiently the details and 
some of the basic assumptions of the theory in order to make it useful to 
biologists. The next subsections deal with these last three points. 

6.1 Evolution of Intelligence: From Memory to Com- 
putation 

There is a duality between computation and memory. Given infinite re- 
sources the two points of view are equivalent: for instance, I could play 
chess by precomputing winning moves for every possible state of the chess- 
board! More to the point, notice that basic logical operations can be 
denned in terms of truth tables and that all boolean predicates can be 
represented in disjunctive normal form, i.e., as a look-up table. 

Given that the brain probably has a prodigeous amount of memory 
and given that one can build powerful approximating look-up tables using 
techniques such as HyperBF, is it possible that part of intelligence may 
be built from a set of souped-up look-up tables? One advantage of this 
point of view is to make perhaps easier to understand how intelligence 
may have evolved from simple associative reflexes. In more than one sense 
(biophysical and computational), HyperBF-like networks are a natural and 
rather straightforward development of very simple systems of a few neurons 
showing basic learning phenomena such as classical conditioning. 

6.2 Predictions and Remarks 

General Predictions 

• Computation, as generalization from examples, emerges from the su- 
perposition of receptive fields in a multidimensional input space. 
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• Computation is performed by Gaussian receptive fields and their 
combination (through some approximation to multiplication), rather 
than by threshold functions. 

• The theory predicts the existence of low- dimensional feature-like cells 
and multidimensional Gaussian-like receptive fields, somewhat sim- 
ilar to template-like cells. One would expect to find cells that are 
tuned to low-level templates, like edges or corners, and others that 
are tuned to higher-level templates such as eyes and faces. In all 
cases, the prediction is that the activity of the cell should be graded 
and should depend in a gaussian-like way on the distance of the input 
from the optimal template along any of the defining dimensions , a 
fact that could be tested experimentally on cortical cells. 

• The HyperBF scheme is a general-purpose circuit, used in the brain 
to synthesize modules that can be regarded as approximating look-up 
tables. If this point of view is correct, we expect the same basic kind 
of neural machinery to be replicated in different parts of the brain 
across different modalities (in particular in different cortical areas). 



• 



The "programming style" used by the brain in solving specific percep- 
tual and motor problems is to synthesize appropriate architectures 
from modules of the type shown in figure 1 (a very simple architecture 
built from the basic module of figure 1 is shown in figure 4). 

Face Neurons 

1. Some of the face cells correspond to basis functions with centers in a 
high dimensional input space and are somewhat similar to prototypes 
or coarse "grandmother cells" 

2. They could be synthesized as the conjunctions of features with Gaussian- 
like distance from the prototype. 

3. Face cells are not detectors; often several may be active simultane- 
ously. The output of the network is a combination of several proto- 
types. 

4. From our preliminary experiments (Poggio and Edelman, 1990) the 
number of basis cells that are required per object is atout 40-80 for 
the full viewing sphere, but much less (10-20) for each aspect (for 
instance frontal views). I conjecture that a similar estimate holds for 
faces. 
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5. Input to the face cells are features such as eye positions, mouth po- 
sition, hair color and so on. 

6. Eye features cells may be themselves the output of HyperBF networks 
specialized for eyes. 

Cerebellum 

1. The cerebellum is a set of approximation modules for learning to 
perform motor skills (both movements and posture). 

2. Its neurons are elements of a HyperBF network: the mossy fibers 
are the inputs, the granule cells correspond to the basis functions 
<3(x,Xi), the Purkinje cells correspond to the output units that sum- 
mate the weighted activities of the basis units, whereas the climbing 
fibers carry the "teacher" signal ?/;. 

3. The strength of the modifiable synapses between the parallel fibers 
and the Purkinje cells corresponds to the c a . 

4. Golgi cells may be involved in modifying during learning the center 
positions t a and the norm- weights W. 

Motor Control 

1. The qualitative expectation is to find cells and circuits corresponding 
to the two stages shown in figure 6. Spinal cord neurons, according 
to very recent data by Bizzi et al. (1990), specify the limb's final 
position and configuration. 

6.3 The Future 

The proposal of this paper is just a rough sketch of a theory. Many details 
- some of them critical - need to be filled in. Some basic questions remain. 
For instance, how reasonable is the idea of supervised learning schemes? 
Or, to say it in a different and perhaps more constructive way, what are 
the systems that can be synthesized from building blocks that are just 
function approximation modules? And what types of tasks can be solved 
by systems of that type? 

On the biological side of the theory, the obvious next task is to develop 
detailed proposals for the circuitries underlying face recognition, and motor 
control (including the circuitry of the cerebellum) that take into account 
up-to-date physiological and anatomical data. 
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NOTES 

Section 1 

• Segmentation of an image in parts that are likely to correspond to separate 
objects is probably the most difficult problem in vision. Remember that 
already in the Perceptron book (Minsky and Papert, 1969) recognition-in- 
context was shown to be significantly harder than recognition of isolated 
patterns. We assume here that this problem has been "solved," at least 
to a reasonable extent. 

• The same basic machinery in the brain may be used for synthesizing many 
different, "small" learning modules, as components of many different sys- 
tems. This is very different from suggesting a single giant network that 
learns everything. 

Section 2 
The relevant derivatives for optimization methods that need them are 

• for the c a 
5J?[/,1 = -2£ A . G (ll^-Mlw) , (6) 



9c " ;=i 



• for the centers t. 



^p- = 4c a ^ AiG"(|| Xi - t a ||^)W T W( Xi - t a ) 



0t« 

tr=l 

• and for W: 

N 



(?) 



^fJP = -4W £ c £ A < G '(II X < - *«UW)<fc«. 



dW 



(8) 



where Q it(X = (x t - - t a )(x; - t a ) T is a dyadic product and G' is the first 
derivative of G (for details see Poggio and Girosi, 1990a). 



Section 3 

• There are many non-radial functions derived from our regularization for- 
mulation such as tensor- product splines,-that arefactorizable. 

• I have assumed here that all centers have the same W. It is possible 
to have sets of different Green's functions, each set with its own W (see 
Poggio and Girosi, 1990a). 
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• It is natural to imagine hierarchical architectures based on the HyperBF 
scheme: a multidimensional Gaussian "template" unit may be a "feature" 
input for another radial function (again because of the factorization prop- 
erty of the Gaussian). Of course, a whole HyperBF network may be one 
of the inputs to another HyperBF network. 

• I conjecture that equation 8 could be approximated by a Hebbian-like rule 
for the elements of the diagonal W such as 



• 



n 

w k (t + 1) = Wi (t) - Y, c«7(**(0 " (*«)*)Vfc(0> ( 9 ) 

a=l 

where y is the output of the upper layer of figure la, i.e., y = Wx and 7 



is 



7 = AiG'fllxi-talfiy), (10) 

and i labels the i-th example. Such a Hebbian rule requires back-connections 
from later stages in the network to the upper layer - where W is updated 
- in order to broadcast quantities such as the error of the overall network 
relative to the i-th example and the derivative G' of the activation of the 
units. 

• The mechanisms and especially the connections needed to implement the 
learning equations or some equivalent scheme are an open question, in 
terms of biological plausibility. More work is needed. 

Section 4 

• The HyperBF scheme addresses only one part of the problem of shape- 
based object recognition, the variability of object appearance due to chang- 
ing viewpoint. The key issue of how to detect and identify image features 
that are stable for different illuminations and viewpoints is outside the 
scope of the network. 

• Notice that the HyperBF approach to recognition does not require as 
inputs the x,y coordinates of image features: other parameters of appro- 
priate features can also be used. 

In a similar vein, notice that the HyperBF network can provide, with the 
same centers (but different c), other. parameters. of the object, such as its 
pose, instead of simply a yes, no recognition signal. 
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> Recognition of noisy and partially occluded objects, using realistic feature 
identification schemes, requires an extension of the scheme. A natural 
extension of the scheme is based on the use of multiple lower- dimensional 
centers, corresponding to different subsets of detected features, instead 
of one 2N- dimensional center for each view in the example set. This 
corresponds to a set of networks capable of recognizing different parts of 
an object. It is equivalent to a set of networks each with a diagonal W 
with some zero entries in the diagonal, instead of one network with W 
with non zero diagonal elements. 

Not all features may be always labeled correctly. In general, one expects 
a significant "correspondence" problem. Possibly the easiest solution is 
to generate all reasonable sequences of labels for a given input vector and 
simply try them out on the network. This is of course equivalent to trying 
in parallel the given input on many networks each with a different labeling 
of its inputs. 

An obvious use of these learning/approximation modules based on the 
HyperBF techniqued is based on a hierarchical composition of GRBF 
modules, in which the outputs of lower-level modules assigned to detect 
object parts and their relative disposition in space are combined to allow 
recognition of complex structured objects. Figure 4 is an example of this 
architecture. 

Section 5 

Zipser and Andersen (1988) have presented intriguing simulations sug- 
gesting that a backpropagation network trained to solve the problem of 
converting visual stimuli in retinal coordinates to head centered coordi- 
nates generates receptive fields similar to the ones experimentally found 
in cortical area 7 of the monkey. We conjecture that Andersen's data may 
be better accounted for by a HyperBF network. For simplicity, let us 
consider the one dimensional version of the problem Zipser and Andersen 
propose is solved by neurons in area 7. The position of a spot of light on 
the retina is given as r; the eye position relative to the head is also known 
as e. The problem is to compute the position of the spot of light relative 
to the head, i.e., h - r + e. Stated in these terms, the problem is compu- 
tationally trivial and its solution simply requires the addition of the two 
inputs t and e. The situation is, however, more complicated due to the 
actual representation in which r and e are given. In the equation, r and e 
are represented as numbers. Zipser and Andersen assume, in accordance 
with physiology, a different representation: they assume that the position 
r of a spot of light is coded by the presence or absence of activity of one 
or more cells in a retinotopic array. From this point of view, the goal of 
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the computation carried out by the network is to change representation 
from array representation to number representation. 

The simplest solution to the problem of changing from an array represen- 
tation to a number representation is the following. Assume that only one 
cell in the array f(x) is excited at any given position, i.e., f(x) = S(r — x). 
Simplifying somewhat the situation assumed by Zipser and Andersen, but 
not altering it in any significant way, let us assume that e is represented 
directly as a number or a firing rate. The problem then is to convert the 
array representation f(x) = 6(r - x) for the retinal position into a number 
(or a firing rate) representation. Consider a linear unit that summates lin- 
early all inputs with the "receptive field" w(x). The output / is given by 
/ = / w(x)f{x)dx. For f(x) = S(x - r), the choice w(x) = x yields / = r. 
Thus a simple solution to our problem of converting an array representa- 
tion into a number representation only needs receptive fields that increase 
linearly with eccentricity (notice that w(x) = ax may also be acceptable; 
simply a monotonic dependence on x may be a sufficient approximation). 

If a Gaussian HyperBF network with a polynomial term of degree one is 
used to approximate the relation of the equation from a set of input- output 
examples, some of the basis functions will be linear units such as the ones 
described above and some will be the product of 2D Gaussians representing 
the visual receptive fields and 2D Gaussians representing the eye position. 
These latter cells would probably account for the multiplicative property 
of the area 7 cells found by Andersen. We conjecture that other features 
of the cells could be replicated in a HyperBF simulation. 
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Figure captions 

Fig. 1. (a) The basic learning module that - we conjecture - is used by the 
brain for a number of tasks. The module learns to approximate a multivariate 
functions from a set of examples (i.e., a set of input-output pairs), (b A HyperBF 
network equivalent to a module for approximating a scalar function of three 
variables from sparse and noisy data. The data, a set of points where the 
value of the function is known, can be considered as examples to be used during 
learning. The hidden units evaluate the function <3(x; t n ), and a fixed, nonlinear, 
invertible function may be present after the summation. The units are in general 
fewer than the number of examples. The parameters that are determined during 
learning are the coefficients c„, the centers t n and the norm- weights W. In 
the radial case G = G(\\x - t n \\w) and the hidden units simply compute the 
radial basis functions G at the "centers" t n . The radial basis functions may be 
regarded as matching the input vectors against the "templates" or "prototypes" 
that correspond to the centers (consider, for instance of a radial Gaussian around 
its center, which is a point in the n-dimensional space of inputs). There may be 
also connections computing the polynomial term of 1: constant and linear terms 
(the dotted lines in figure lb) may be expected in most cases. 

Fig. 2 A three-dimensional radial Gaussian implemented by multiplying two- 
dimensional Gaussian and one- dimensional Gaussian receptive fields. The latter 
two functions are synthesized directly by appropriately weighted connections 
from the sensor arrays, as neural receptive fields are usually thought to arise. 
Notice that they transduce the implicit position of stimuli in the sensor array 
into a number (the activity of the unit). They thus serve the dual purpose of 
providing the required "number" representation from the activity of the sen- 
sor array and of computing a Gaussian function. 2D Gaussians acting on a 
retinotopic map can be regarded as representing 2D "features," while the radial 
basis function represents the "template" resulting from the conjunction of those 
lower- dimensional features. 

Fig. 3. (a) The HyperBF network proposed for the recognition of a 3D object 
from any of its perspective views (Poggio and Edelman, 1990). The network 
attempts to map any view (as denned in the text) into a standard view, ar- 
bitrarily chosen. The norm of the difference between the output vector f and 
the standard view s is thresholded to yield a 0, 1 answer. The 2JV inputs ac- 
comodate the input vector v representing an arbitrary view. Each of the K 
radial basis functions is initially centered on one of a subset of the M views used 
to synthesize the system (K < M). During training each of the M inputs in 
the training set is associated with the desired output, i.e., the standard view 
s. Fig. 3(b) shows a completely equivalent interpretation of (a) for the special 
case of Gaussian radial basis functions. Gaussian functions can be synthesized 
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by multiplying the outputs of two-dimensional Gaussian receptive fields, that 
"look" at the retinotopic map of the object point features. The solid circles 
in the image plane represent the 2D Gaussians associated with the first radial 
basis function, which represents the first view of the object. The dotted circles 
represent the 2D receptive fields that synthesize the Gaussian radial function 
associated with another view. The 2D Gaussian receptive fields transduce po- 
sitions of features, represented implicitly as activity in a retinotopic array, and 
their product "computes" the radial function without the need of calculating 
norms and exponentials explicitly. From Poggio and Girosi (1990b). 

Fig. 4. A hierarchical scheme in which HyperBF modules are inputs to another 
HyperBF module. As an example, a scheme of this type may be used for 3D 
object recognition in the general case of spurious and missing features. Instead 
of encoding all n features one encodes only subsets of dimensions d, where d < n. 
The inputs to each of the first row of modules is a different set of features of the 
object; the output is a value between 0, 1 that indicates the degree of certainty 
that the input is the sought object. The last module is a decision module that 
integrates the various inputs. Notice that all modules could be synthesized by 
learning through independent sets of examples. 

Fig. 5. (a) A sketch of the neurons of the cerebellum and their connections. 
In our conjecture, these would be the basic elements of a HyperBF network: the 
mossy fibers are the inputs, the granule cells correspond to the various centers 
and basis functions C?(x,x;), the Purkinje cells correspond to the output units 
that summate the weighted activities of the basis units, whereas the climbing 
fibers carry the "teacher" signal yi. The strength of the synapses between the 
parallel fibers and the Purkinje cells would correspond to the c a . (b) The 
corresponding HyperBF network is shown on the right: it has two basis functions 
corresponding to the two granule cells on the left and two output summation 
units corresponding to the two Purkinje cells on the left. 

Fig. 6. Two problems in motor control: (a) determining the trajectory x(i) 
from a small set of points (*;,£;) on the desired trajectory and (b) computing 
the field of muscle forces for each of the points on the trajectory. The figure 
suggests that two different HyperBF modules may be used to perform both 
tasks. In (a) a HyperBF module approximates the trajectory from the sparse 
points by superimposing Gaussian distributions with the appropriate weights 
in such a way to satisfy some minimum- jerk-like principle. In (b) a module of 
the HyperBF type has been synthesized during development and continuously 
adapted to generate the appropriate field of forces for each equilibrium position 
x. It is similar to an approximating look-up table. A behaviour of the look-up 
table type was suggested by Bizzi because of very recent experimental data (see 
Bizzi et al., 1990). 
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